H . Hermansky : " AUDITORY MODELING IN AUTOMATIC RECOGNITION OF SPEECH
نویسنده
چکیده
The paper argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task and advance the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress some parts of information in the speech message. In general, it advocates selective use of auditory knowledge, optimized on real speech data. 1 Ignorance and Knowledge in Handling The Nonlinguistic Variability in ASR With an advent of powerful stochastic classiication techniques , many believe that the task of alleviating the irrelevant variability should be left on the classiier. In the training stage the recognizer would be presented with all the information that is available (i.e., both \signal" and \noise"). Such exhaustive training should allow (during recognition) for a separation of the desired signal from the noise. However, for this to be true, the classiier would have to be a model-free classiier (so that no prior knowledge constrains its capabilities) and it would have to be presented with all possible past and future speech data. Therefore, weakly-structured data-driven ASR systems require large amounts of training data. Further, the knowledge acquired by the training is typically speciic to a given application and may not be easily reusable in another system. Some speculate that more structured designs with some build-in knowledge of human speech communication processes could provide desirable constraints for the ASR problem. As will become more obvious after reading through this paper, we are currently advocating selective use of auditory knowledge optimized on real speech data. 2 Analysis in ASR The information rate in speech signal is by some estimates 1] of the order of 36 kbits/s. The written equivalent of the linguistic message in the signal is less that 50 bits/s 1]. The general task of ASR is to identify and reduce information in the signal. The analysis should support this goal by alleviating as much as possible non-linguistic factors in the signal. Additional practical requirements for current ASR techniques are that speech features derived from speech analysis should be low-dimensional and their statistics should be well described by gaussian distributions. Automatic speech recognition (ASR) typically uses features based on a short-term spectrum of speech which describes time-varying speech signal as a sequence of short-term feature vectors. Each vector reeects properties of a relatively short (10-20 ms) segment of the signal. Each individual feature vector is …
منابع مشابه
Introducing temporal asymmetries in feature extraction for automatic speech recognition
We propose a new auditory inspired feature extraction technique for automatic speech recognition (ASR). Features are extracted by filtering the temporal trajectory of spectral energies in each critical band of speech by a bank of finite impulse response (FIR) filters. Impulse responses of these filters are derived from a modified Gabor envelope in order to emulate asymmetries of the temporal re...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملProperties of Stochastic Perceptual Auditory-event-based Models for Automatic Speech Recognition
Recently, physiological and psychoacoustic studies have uncovered new evidence supporting the idea that human auditory processes focus on the transitions between spoken sounds rather than on the steady-state portions of spoken sounds for speech recognition. Stochastic Perceptual Auditory-event-based Models (SPAMs) were developed by Morgan, Bourlard, Hermansky and Greenberg to take this new evid...
متن کاملA Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملStochastic perceptual auditory-event-based models for speech recognition
We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. A focus on perceptual models ...
متن کامل